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ABSTRACT 

Many data sets contain rich information about objects, as 
well as pairwise relations between them. For instance, in 
networks of websites, scientific papers, and other documents, 
each node has content consisting of a collection of words, as 
well as hyperlinks or citations to other nodes. In order to 
perform inference on such data sets, and make predictions 
and recommendations, it is useful to have models that are 
able to capture the processes which generate the text at 
each node and the links between them. In this paper, we 
combine classic ideas in topic modeling with a variant of 
the mixed-membership block model recently developed in 
the statistical physics community. The resulting model has 
the advantage that its parameters, including the mixture 
of topics of each document and the resulting overlapping 
communities, can be inferred with a simple and scalable 
expectation-maximization algorithm. We test our model on 
three data sets, performing unsupervised topic classification 
and link prediction. For both tasks, our model outperforms 
several existing state-of-the-art methods, achieving higher 
accuracy with significantly less computation, analyzing a 
data set with 1.3 million words and 44 thousand links in 
a few minutes. 
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1. INTRODUCTION 

Many modern data sets contain not only rich information 
about each object, but also pairwise relationships between 
them, forming networks where each object is a node and 
links represent the relationships. In document networks, for 
example, each node is a document containing a sequence of 
words, and the links between nodes are citations or hyper- 
links. Both the content of the documents and the topology 
of the links between them are meaningful. 



Over the past few years, two disparate communities have 
been approaching these data sets from different points of 
view. In the data mining community, the goal has been to 
augment traditional approaches to learning and data mining 
by including relations between objects [151 [33] for instance, 
to use the links between documents to help us label them 
by topic. In the network community, including its subset in 
statistical physics, the goal has been to augment traditional 
community structure algorithms such as the stochastic block 
model [14U20l[30] by taking node attributes into account: for 
instance, to use the content of documents, rather than just 
the topological links between them, to help us understand 
their community structure. 

In the original stochastic block model, each node has a dis- 
crete label, assigning it to one of k communities. These la- 
bels, and the kxk matrix of probabilities with which a given 
pair of nodes with a given pair of labels have a link between 
them, can be inferred using Monte Carlo algorithms (e.g. 
[26]) or, more efficiently, with belief propagation [121 lll[ or 
pseudolikelihood approaches [?]• However, in real networks 
communities often overlap, and a given node can belong to 
multiple communities. This led to the mixed-membership 
block model ,1 , where the goal is to infer, for each node v, 
a distribution or mixture of labels 8v describing to what ex- 
tent it belongs to each community. If we assume that links 
are assortative, i.e., that nodes are more likely to link to 
others in the same community, then the probability of a link 
between two nodes v and v' depends on some measure of 
similarity (say, the inner product) of Ov and O^i . 

These mixed-membership block models fit nicely with clas- 
sic ideas in topic modeling. In models such as Probabilistic 
Latent Semantic Analysis (plsa) [19] and Latent Dirichlet 
Allocation (lda) [4 , each document d has a mixture 9d of 
topics. Each topic corresponds in turn to a probability dis- 
tribution over words, and each word in d is generated in- 
dependently from the resulting mixture of distributions. If 
we think of 6d as both the mixture of topics for generating 
words and the mixture of communities for generating links, 
then we can infer {9d} jointly from the documents' content 
and the presence or absence of links between them. 



There are many possible such models, and we are far from 
the first to think along these lines. Our innovation is to take 
as our starting point a particular mixed-membership block 
model recently developed in the statistical physics commu- 



nity [2], which we refer to as the bkn model. It differs from 
the mixed-membership stochastic block model (mmsb) of [T] 
in several ways: 

1. The BKN model treats the community membership mix- 
tures 9d directly as parameters to be inferred. In con- 
trast, MMSB treats 9d as hidden variables generated by 
a Dirichlet distribution, and infers the hyperparame- 
ters of that distribution. The situation between plsa 
and LDA is similar; PLSA infers the topic mixtures 6d, 
while LDA generates them from a Dirichlet distribution. 

2. The MMSB model generates each link according to a 
Bernoulli distribution, with an extra parameter for 
sparsity. Instead, bkn treats the links as a random 
multigraph, where the number of links A^d' between 
each pair of nodes is Poisson-distributed. As a result, 
the derivatives of the log-likelihood with respect to 9d 
and the other parameters are particularly simple. 

These two factors make it possible to fit the bkn model 
using an efficient and exact expectation-maximization (EM) 
algorithm, making its inference highly scalable. The bkn 
model has another advantage as well: 

3. The BKN model is degree-corrected, in that it takes the 
observed degrees of the nodes into account when com- 
puting the expected number of edges between them. 
Thus it recognizes that two documents that have very 
different degrees might in fact have the same mix of 
topics; one may simply be more popular than the other. 

In our work, we use a slight variant of the bkn model to 
generate the links, and we use PLSA to generate the text. 
We present an EM algorithm for inferring the topic mixtures 
and other parameters. (While we do not impose a Dirichlet 
prior on the topic mixtures, it is easy to add a corresponding 
term to the update equations.) Our algorithm is scalable in 
the sense that each iteration takes 0(K{N -I- M -\- R)) time 
for networks with K topics, documents, and M links, 
where R is the sum over documents of the number of distinct 
words appearing in each one. In practice, our EM algorithm 
converges within a small number of iterations, making the 
total running time linear in the size of the corpus. 

Our model can be used for a variety of learning and gen- 
eralization tasks, including document classification or link 
prediction. For document classification, we can obtain hard 
labels for each document by taking its most-likely topic with 
respect to 9d, and optionally improve these labels further 
with local search. For link prediction, we train the model 
using a subset of the links, and then ask it to rank the re- 
maining pairs of documents according to the probability of a 
link between them. For each task we determine the optimal 
relative weight of the content vs. the link information. 

We performed experiments on three real-world data sets, 
with thousands of documents and millions of words. Our 
results show that our algorithm is more accurate, and con- 
siderably faster, than previous techniques for both document 
classification and link prediction. 



The rest of the paper is organized as follows. Section [2] de- 
scribes our generative model, and compares it with related 
models in the literature. Section |3] gives our EM algorithm 
and analyzes its running time. Section |4] contains our exper- 
imental results for document classification and link predic- 
tion, comparing our accuracy and running time with other 
techniques. In Section O we conclude, and offer some direc- 
tions for further work. 

2. OUR MODEL AND PREVIOUS WORK 

In this section, we give our proposed model, which we call 
the Poisson mixed-topic link model (pmtlm) and its degree- 
corrected variant pmtlm-dc. 

2.1 The Generative Model 

Consider a network of TV documents. Each document d has 
a fixed length Ld, and consists of a string of words Wdi for 
1 < £ < Ld, where 1 < Wde < W where W is the number 
of distinct words. In addition, each pair of documents d, d' 
has an integer number of links connecting them, giving an 
adjacency matrix Add'- There are K topics, which play the 
dual role of the overlapping communities in the network. 

Our model generates both the content {wdt} and the links 
i^dd'} as follows. We generate the content using the PLSA 
model [19]. Each topic z is associated with a probabil- 
ity distribution over words, and each document has a 
probability distribution 9d over topics. For each document 
1 < d < N and each 1 < ^ < Ld, we independently 

1. choose a topic z = Zde ~ Multi(6'd), and 

2. choose the word Wde ~ Multi{l3z). 

Thus the total probability that Wdt is a given word w is 

K 

Pr[wd£ = w] = ^ 9dzl3zw ■ (1) 

z = l 

We assume that the number of topics K is fixed. The dis- 
tributions /3z and 9d are parameters to be inferred. 

We generate the links using a version of the Ball-Karrer- 
Newman (bkn) model [2]. Each topic z is associated with a 
link density jy^. For each pair of documents d,d' and each 
topic z, we independently generate a number of links which 
is Poisson-distributed with mean 9dz9d'zV!i- Since the sum of 
independent Poisson variables is Poisson, the total number 
of links between d and d' is distributed as 

^dd' ~Poi ij29dz9d'zvA ■ (2) 



Since Add' can exceed 1, this gives a random multigraph. 
In the data sets we study below, Add' is 1 or depending 
on whether d cites d' , giving a simple graph. On the other 
hand, in the sparse case the event that Add' > 1 has low 
probability in our model. Moreover, the fact that Add' is 
Poisson-distributed rather than Bernoulli makes the deriva- 
tives of the likelihood with respect to the parameters 9dz 
and rjz very simple, allowing us to write down an efficient 
EM algorithm for inferring them. 




(a) LINK-LDA (b) C-PLDC 




(c) RTM (d) PMTLM-DC 



Figure 1: Graphical models for link generation. 

This version of the model assumes that links are assortative, 
i.e., that links between documents only form to the extent 
that they belong to the same topic. One can easily generalize 
the model to include disassortative links as well, replacing 
riz with a matrix rj^^i that allows documents with distinct 
topics z, z' to link [2]. 

We also consider degree- corrected versions of this model, 
where in addition to its topic mixture 8d, each document 
has a propensity 5*;; of forming links. In that case. 

Am' ~ Poi (^SiSa' 9d.9a,.v}j . (3) 

We call this variant the Poisson Mixed-Topic Link Model 
with Degree Correction (pmtlm-dc). 

2.2 Prior Work on Content-Link Models 

Most models for document networks generate content using 
either plsa [19], as we do, or lda [4]. The distinction is that 
PLSA treats the document mixtures 9d as parameters, while 
in LDA they are hidden variables, integrated over a Dirichlet 
distribution. As we show in Section [31 our approach gives a 
simple, exact EM algorithm, avoiding the need for sampling 
or variational methods. While we do not impose a Dirichlet 
prior on 6d in this paper, it is easy to add a corresponding 
term to the update equations for the EM algorithm, with no 
loss of efficiency. 

There are a variety of methods in the literature to generate 
links between documents, phits-plsa [TD], link-lda [T^ 
and LINK-PLSA-LDA 27 use the phits [9] model for link 
generation, phits treats each document as an additional 
term in the vocabulary, so two documents are similar if they 
link to the same documents. This is analogous to a mix- 
ture model for networks studied in [28,. In contrast, block 
models like ours treat documents as similar if they link to 
similar documents, as opposed to literally the same ones. 

The PAIRWISE LINK-LDA model [57], like ours, generates the 
links with a mixed-topic block model, although as in mmsb [T] 



and LDA U it treats the 9d as hidden variables integrated 
over a Dirichlet prior. They fit their model with a varia- 
tional method that requires A''^ parameters, making it less 
scalable than our approach. 

In the C-PLDC model 32, , the link probability from d to d' is 
determined by their topic mixtures 9d, 9d' and the popularity 
td' of d' , which is drawn from a Gamma distribution with 
hyperparameters a and b. Thus td' plays a role similar to 
the degree-correcting parameter Sd' in our model, although 
we correct for the degree of d as well. However, C-pldc does 
not generate the content, but takes it as given. 

The Relational Topic Model (rtm) [5l [6] assumes that the 
link probability between d and d' depends on the topics of 
the words appearing in their text. In contrast, our model 
uses the underlying topic mixtures 9d to generate both the 
content and the links. Like our model, rtm defines the sim- 
ilarity of two topics as a weighted inner product of their 
topic mixtures: however, in rtm the probability of a link is 
a nonlinear function of this similarity, which can be logistic, 
exponential or normal, of this similarity. 

Although it deals with a slightly different kind of dataset, 
our model is closest in spirit to the Latent Topic Hypertext 
Model (lthm) [18]. This is a generative model for hypertext 
networks, where each link from d to d' is associated with a 
specific word w in d. If we sum over all words in d, the total 
number of links Add' from d to d' that lthm would generate 
follows a binomial distribution 

Add' ~ Bin ^Ld, Xd' 9d:,9d'^^ , (4) 

where Xd' is, in our terms, a degree-correction parameter. 
When Ld is large this becomes a Poisson distribution with 
mean LdXd' X^z 9dz9d'z- Our model differs from this in two 
ways: our parameters r/z give a link density associated with 
each topic z, and our degree correction Sd does not assume 
that the number of links from d is proportional to its length. 

We briefly mention several other approaches. The authors 
of :TB] extend the probabilistic relational model (prm) frame- 
work and proposed a unified generative model for both con- 
tent and links in a relational structure. In [24; , the authors 
proposed a link-based model that describes both node at- 
tributes and links. The htm model [3T] treats links as fixed 
rather than generating them, and only generates the text. 
Finally, the lmmg model f22] treats the appearance or ab- 
sence of a word as a binary attribute of each document, and 
uses a logistic or exponential function of these attributes to 
determine the link probabilities. 

In Section [4] below, we compare our model to phits-plsa, 
LINK-LDA, C-PLDC, and RTM. Graphical models for the hnk 
generation components of these models, and ours, are shown 
in Figure [T] 

3. A SCALABLE EM ALGORITHM 

Here we describe an efficient Expectation-Maximization al- 
gorithm to find the maximum-likelihood estimates of the pa- 
rameters of our model. Each update takes 0{K{N +M -'rR)) 
time for a document network with K topics, N documents, 



and M links, where R is the sum over the documents of the 
number of distinct words in each one. Thus the running 
time per iteration is hnear in the size of the corpus. 

For simphcity we describe the algorithm for the simpler ver- 
sion of our model, pmtlm. The algorithm for the degree- 
corrected version, pmtlm-dc, is similar. 



3.1 The likelihood 

Let Cdw denote the number of times a word w appears in 
document d. From ([1]), the log- likelihood of d's content is 



ItI. 



logP(wdl, ■ ■ ■ ,WdLa I 0d,l3) 



Y^Cd^ log [Y^dd^p. 



(5) 



Similarly, from ((2|, the log- likelihood for the links Add' is 



i^/ldd'l0g(^( 

dd' \ z 



'dzOd'zV^ 



(6) 



We ignore the constant term — X^dd' log ^dd'! from the de- 
nominator of the Poisson distribution, since it has no bearing 
on the parameters. 



3.2 Balancing Content and Links 

While wo can use the total hkelihood J2d -Cd"""""' + 'C""''" 
directly, in practice we can improve our performance signifi- 
cantly by better balancing the information in the content vs. 
that in the links. In particular, the log-hkelihood £™ntont ^£ 
each document is proportional to its length, while its contri- 
bution to is proportional to its degree. Since a typical 
document has many more words than links, £<:°n*°"' tends 
to be much larger than 

Following ISy , we can provide this balance in two ways. One 
is to normalize £<:°ntont ^jjg length Ld, and another is to 
add a parameter a that reweights the relative contributions 
of the two terms £™"t™' and We then maximize the 

function 



1 



Cd 



+ {1- Q)r"" 



(7) 



Varying a from to 1 lets us interpolate between two ex- 
tremes: studying the document network purely in terms of 
its topology, or purely in terms of the documents' content. 
Indeed, we will see in Section |4] that the optimal value of 
a depends on which task we are performing: closer to for 
link prediction, and closer to 1 for topic classification. 



3.3 Update Equations and Running Time 

We maximize £ as a function of {6, /3, rj} using an EM algo- 
rithm, very similar to the one introduced by [2l for overlap- 
ping community detection. We start with a standard trick 



to change the log of a sum into a sum of logs, writing 

W K n n 

^content ^ ^ ^ hdw{z) log 



W—\ 2—1 

K 



hdw{z) 

blinks ^ 1 A I- \^ ^dzdd'zVz 

dd' z = l ^ ' 

1 " 



Here hdw{z) is the probability that a given appearance of w 
in d is due to topic z, and qdd'{z) is the probability that a 
given link from d and d' is due to topic z. This lower bound 
holds with equality when 



hdw{z) 



zPz 



<ldd'{z) 



^dztid'zV^ 



■, (9) 



J2z' (^dz'Pz'w ' J2z' (^dz'Sd'z'Vz 

giving us the E step of the algorithm. 

For the M step, we derive update equations for the parame- 
ters {9, j3, rj}. By taking derivatives of the log-likelihood (O 
(see the Appendix |X] for details) we obtain 



■nz = 



J2dd' ^dd'<ldd'{z) 



5^^(l/I/d) Cdmhdw{z) 

Ed(i/-f'd) E„' Cd^'hd^'iz) 



(10) 

(11) 



ja/Ld) Cdwhdwjz) -I- (1 - q) J2d' Add'qdd'jz) 



a -I- (1 - a)hid 



Here K.d = J^d' ^^dd' is the degree of document 



(12) 



To analyze the running time, let Rd denote the number of 
distinct words in document d, and let R — ^d^d- Then 
only KR of the parameters hdw{z) are nonzero. Similarly, 
Qdd' {z) only appears if Add' 7^ 0, so in a network with M 
links only KM of the Qdd' (z) are nonzero. The total num- 
ber of nonzero terms appearing in ((9))- (|12p . and hence the 
running time of the E and M steps, is thus 0{K{N-\-M-\-R)). 

As in [2], we can speed up the algorithm if 6 is sparse, i.e. 
if many documents belong to fewer than K topics, so that 
many of the Odz are zero. According to ((9)1, if 9dz ~ then 
hde{z) = Qdd'iz) ~ 0, in which case (|12p implies that 9dz ~ 
for all future iterations. If we choose a threshold below which 
9dz is effectively zero, then as 9 becomes sparser we can 
maintain just those hde(z) and qdd'{z) where 9dz 7^ 0. This 
in turn simplifies the updates for rj and /3 in (|10|) and pi|) . 

We note that the simplicity of our update equations comes 
from the fact that the Add' is Poisson, and that its mean is a 
multilinear function of the parameters. Models where Add' 
is Bernoulli-distributed with a more complicated link prob- 
ability, such as a logistic function, have more complicated 
derivatives of the likelihood, and therefore more complicated 
update equations. 

Note also that this EM algorithm is exact, in the sense that 
the maximum-likelihood estimators {9, /3, rj} are fixed points 
of the update equations. This is because the E step (O is 
exact, since the conditional distribution of topics associated 



with each word occurrence and each hnk is a product dis- 
tribution, which we can describe exactly with hd^ and qa^i . 
(There are typically multiple fixed points, so in practice we 
run our algorithm with many different initial conditions, and 
take the fixed point with the highest likelihood.) 

This exactness is due to the fact that the topic mixtures 
9d are parameters to be inferred. In models such as lda 
and MMSB where 9d is a hidden variable integrated over a 
Dirichlet prior, the topics associated with each word and link 
have a complicated joint distribution that can only be ap- 
proximated using sampling or variational methods. (To be 
fair, recent advances such as stochastic optimization based 
on network subsampling [17) have shown that approximate 
inference in these models can be carried out quite efficiently.) 

On the other hand, in the context of finding communities in 
networks, models with Dirichlet priors have been observed 
to generalize more successfully than Poisson models such 
as BKN [T7]. Happily, we can impose a Dirichlet prior on 
Od with no loss of efficiency, simply by including pseudo- 
counts in the update equations — in essence adding addi- 
tional words and links that are known to come from each 
topic (see Appendix [U]). This lets us obtain a maximum a 
posteriori (MAP) estimate of an LDA-like model. We leave 
this as a direction for future work. 

3.4 Discrete Labels and Local Search 

Our model, like plsa and the bkn model, lets us infer a 
soft classification — a mixture of topic labels or community 
memberships for each document. However, we often want to 
infer categorical labels, where each document d is assigned 
to a single topic 1 < Zd < K. A natural way to do this 
is to let Zd be the most-likely label in the inferred mixture, 
Zd ~ argmax^ 6dz ■ This is equivalent to rounding 9d to a 
delta function, 9dz = 1 for z = Zd and for z ^ Zd- 

If we wish, we can improve these discrete labels further using 
local search. If each document has just a single topic, the 
log-likelihood of our model is 

w 

^content ^ ^ log^2^„ (13) 

C^-"-^ ^\Y.Add'\o%n^,.,. . (14) 

dd' 

Note that here 77 is a matrix, with off-diagonal entries that al- 
low documents with different topics z^j, z^j/to be linked. Oth- 
erwise, these discrete labels would cause the network to split 
into K separate components. 

Let riz denote the number of documents of topic 2, let Lz = 
Z]d:zrf=z ^heir total length, and let Cz„ = 1]^,^^=^ 

be the total number of times w appears in them. Let m^^t 
denote the total number of links between documents of top- 
ics z and z\ counting each link twice if 2 = 2'. Then the 
MLEs for /? and t] are 

Pzw = , rizz' = ■ (15) 

Lz UzUz' 

Applying these MLEs in (|13p and (|14|) gives us a point es- 
timate of the likelihood of a discrete topic assignment Zd, 



which we can normalize or reweight as discussed in Sec- 
tion 13.21 if we like. We can then maximize this likelihood 
using local search: for instance, using the Kernighan-Lin 
heuristic as in [21] or a Monte Carlo algorithm to find a lo- 
cal maximum of the likelihood in the vicinity of 2. Each step 
of these algorithms changes the label of a single document 
d, so we can update the values of Uz, Lz, Czm, and m^^i 
and compute the new likelihood in 0{K + Rd) time. In our 
experiments we used the KL heuristic, and found that for 
some data sets it noticeably improved the accuracy of our 
algorithm for the document classification task. 

4. EXPERIMENTAL RESULTS 

In this section we present empirical results on our model 
and our algorithm for unsupervised document classification 
and link prediction. We compare its accuracy and running 
time with those of several other methods, testing it on three 
real-world document citation networks. 

4.1 Data Sets 

The top portion of Table [1] lists the basic statistics for three 
real- world corpora [25]: Cora, Citeseer, and PubMecQ. Cora 
and Citeseer contain papers in machine learning, with K = 7 
topics for Cora and K = 6 for Citeseer. PubMed consists 
of medical research papers on K = 3 topics, namely three 
types of diabetes. All three corpora have ground-truth topic 
labels provided by human curators. 

The data sets for these corpora are slightly different. The 
PubMed data set has the number of times Cdw each word 
appeared in each document, while the data for Cora and 
Citeseer records whether or not a word occurred at least 
once in the document. For Cora and Citeseer, we treat Cdw 
as being or 1. 

4.2 Models and Implementations 

We compare the Poisson Mixed- Topic Link Model (pmtlm) 
and its degree-corrected variant, denoted pmtlm-dc, with 
PHITS-PLSA, LiNK-LDA, C-PLDC, and RTM (see Section . 
We used our own implementation of both phits-plsa and 
RTM. For RTM, we implemented the variational EM algo- 
rithm given in [6]. The implementation is based on the LDA 
code available from the authorfl We also tried the code 
provided by J. Chan^, which uses a Monte Carlo algorithm 
for the E step, but we found the variational algorithm works 
better on our data sets. While RTM includes a variety of link 
probability functions, we only used the sigmoid function. 
We also assume a symmetric Dirichlet prior. The results for 
LINK-LDA and C-PLDC are taken from [32] . 

Each E and M step of the variational algorithm for RTM per- 
forms multiple iterations until they converge on estimates for 
the posterior and the parameters 6 . This is quite different 
from our EM algorithm: since our E step is exact, we update 
the parameters only once in each iteration. In our implemen- 
tation, the convergence condition for the E step and for the 
entire EM algorithm are that the fractional increase of the 

^These data sets are available for download at 

http: //www. cs .umd. edu/projects/linqs/projects/ lbc/| 
''See http : //www. cs .princeton. edu/'blei/lda-c/j 
^See http : //www. cs . princeton . edu /~blei/lda/ ^ 
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Figure 2: The log-likelihood of the PMTLM and 
PMTLM-DC models as a function of the number 
of EM iterations, normalized so that and 1 are the 
initial and final log-likelihood respectively. The con- 
vergence is roughly the same for all three data sets, 
showing that the number of iterations is roughly 
constant as a function of the size of the corpus. 

log- likelihood between iterations is less than 10~®; we per- 
formed a maximum of 500 iterations of the RTM algorithm 
due to its greater running time. In order to optimize the rj 
parameters (see the graphical model in Section[23J RTM uses 
a tunable regularization parameter p, which can be thought 
of as the number of observed non-links. We tried various set- 
tings for p, namely O.IM, 0.2M, 0.5M, M, 2M, 5M and lOM 
where M is the number of observed links. 

As described in Section 13.21 for pmtlm, pmtlm-dc and 
PHITS-PLSA we vary the relative weight a of the likelihood of 
the content vs. the links, tuning a to its best possible value. 
For the PubMed data set, we also normalized the content 
likelihood by the length of the documents. 
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M 


5,429 
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1,433 
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49,216 
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EM (plsa) 


28 


61 


362 


EM (phits-plsa) 


40 


67 


445 


EM (pmtlm) 


33 


64 


419 


EM (PMTLM-DC) 


36 


64 


402 


EM (RTM) 


992 


597 


2,194 


KL (pmtlm) 


375 


618 


13,723 


KL (PMTLM-DC) 


421 


565 


13,014 



Table 1: The statistics of the three data sets, and 
the mean running time, for the EM algorithms in 
our model PMTLM, its degree-corrected variant 
PMTLM-DC, and PLSA, PHITS-PLSA, and RTM. 
Each corpus has K topics, N documents, M links, a 
vocabulary of size W, and a total size R. Running 
times for our algorithm, PLSA, and PHITS-PLSA 
are given for one run of 5000 EM iterations. Run- 
ning times for RTM consist of up to 500 EM itera- 
tions, or until the convergence criteria are reached. 
Our EM algorithm is highly scalable, with a running 
time that grows linearly with the size of the corpus. 
In particular, it is much faster that the variational 
algorithm for RTM. Improving discrete labels with 
the Kernighan-Lin heuristic (KL) increases our al- 
gorithm's running time, but improves its accuracy 
for document classification in Cora and Citeseer. 

For RTM, in each E step, we initialize the variational param- 
eters randomly, and in each M step we initialize the hyper- 
parameters randomly. We execute 500 independent runs for 
each setting of the tunable parameter p. 



4.3 Document Classification 

4. 3. 1 Experimental Setting 

For PMTLM, PMTLM-DC and PHITS-PLSA, we performed 500 
independent runs of the EM algorithm, each with random 
initial values of the parameters and topic mixtures. For each 
run we iterated the EM algorithm up to 5000 times; we found 
that it typically converges in fewer iterations, with the crite- 
rion that the fractional increase of the log-likelihood for two 
successive iterations is less than lO"'^. Figure [5] shows that 
the log-likelihood as a function of the number of iterations 
are quite similar for all three data sets, even though these 
corpora have very different sizes. This indicates that even 
for large data sets, our algorithm converges within a small 
number of iterations, making its total running time linear in 
the size of the corpus. 

For PMTLM and pmtlm-dc, we obtain discrete topic labels 
by running our EM algorithm and rounding the topic mix- 
tures as described in Section 13.41 We also tested improv- 
ing these labels with local search, using the Kernighan-Lin 
heuristic to change the label of one document at a time until 
we reach a local optimum of the likelihood. More precisely, 
of those 500 runs, we took the T best fixed points of the EM 
algorithm (i.e., with the highest likelihood) and attempted 
to improve them further with the KL heuristic. We used 
T = 50 for Cora and Citeseer and T = 5 for PubMed. 



4.3.2 Metrics 

For each algorithm, we used several measures of the accuracy 
of the inferred labels as compared to the human-curated 
ones. The Normalized Mutual Information (NMI) between 
two labelings Ci and C2 is defined as 

NMI(C„C.)= ^ifrf']r^^ - (16) 
max(II(Gi), H(C2)) 

Here MI(Ci, C2) is the mutual information between Ci and 
C2, and H(Ci) and H(C2) are the entropies of Ci and C2 
respectively. Thus the NMI is a measure of how much infor- 
mation the inferred labels give us about the true ones. We 
also used the Pairwise F-measure (PWF) [3] and the Varia- 
tion of Information (VI) [25] (which we wish to minimize). 

4.3.3 Results 

The best NMI, VI, and PWF we observed for each algorithm 
are given in Table [21 where for link-lda and C-pldc we 
quote results from [32]. For RTM, we give these metrics for 
the labeling with the highest likelihood, using the best value 
of p for each metric. 

We see that even without the additional step of local search, 
our algorithm does very well, outperforming all other meth- 
ods we tried on Citeseer and PubMed and all but C-pldc 
on Cora. (Note that we did not test link-lda or c-pldc 
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RTM 


0.349 


2.306 


0.422 


0.369 


2.209 


0.480 


0.228 


1.646 


0.482 


PMTLM 


0.467 (.4) 


1.957 (.4) 


0.509 (.3) 


0.399 (.4) 


2.106 (.4) 


0.509 (.3) 


0.232 (.9) 


1.639 (1.0) 


0.486 (.9) 


PMTLM (KL) 


0.514 (.4) 


1.778 (.4) 


0.525 (.4) 


0.414 (.6) 


2.057 (.6) 


0.518 (.5) 


0.233 (.9) 


1.642 (.9) 


0.488 (.9) 


PMTLM-DC 


0.474 (.3) 


1.930 (.3) 


0.498 (.3) 


0.402 (.3) 


2.096 (.3) 


0.518 (.3) 


0.270 (.8) 


1.556 (.8) 


0.496 (.8) 


PMTLM-DC (KL) 


0.491 (.3) 


1.865 (.3) 


0.511 (.3) 


0.406 (.3) 


2.084 (.3) 


0.520 (.3) 


0.260 (.8) 


1.577 (.8) 


0.492 (.8) 



Table 2: The best normalized mutual information (NMI), variational of information (VI) and pairwise F- 
measure (PWF) achieved by each algorithm. Values marked by f are quoted from [32j; other values are based 
on our implementation. The best values are shown in bold; note that we seek to maximize NMI and PWF, 
and minimize VI. For PHITS-PLSA, PMTLM, and PMTLM-DC, the number in parentheses is the best value 
of the relative weight a of content vs. links. Refining the labeling returned by the EM algorithm with the 
Kernighan-Lin heuristic is indicated by (KL). 



on PubMed.) Degree correction (pmtlm-dc) improves ac- 
curacy significantly for PubMed. 

Refining our labeling with the KL heuristic improved the 
performance of our algorithm significantly for Cora and Cite- 
seer, giving us a higher accuracy than all the other methods 
we tested. For PubMed, local search did not increase accu- 
racy in a statistically significant way. In fact, on some runs 
it decreased the accuracy slightly compared to the initial 
labeling z obtained from our EM algorithm; this is coun- 
terintuitive, but it shows that increasing the likelihood of a 
labeling in the model can decrease its accuracy. 

In FigureO we show how the performance of pmtlm, pmtlm-dc, 
and PHITS-PLSA varies as a function of q, the relative weight 
of content vs. links. Recall that at q = these algorithms 
label documents solely on the basis of their links, while at 
a = 1 they only pay attention to the content. Each point 
consists of the top 20 runs with that value of a. 

For Cora and Citeseer, there is an intermediate value of a at 
which PMTLM and pmtlm-dc have the best accuracy. How- 
ever, this peak is fairly broad, showing that we do not have 
to tune a very carefully. For PubMed, where we also normal- 
ized the content information by document length, pmtlm-dc 
performs best at a particular value of a. 

We give the running time of these algorithms, including 
PMTLM and PMTLM-DC with and without the KL heuristic, 
in Table [T] and compare it to the running time of the other 
algorithms we implemented. Our EM algorithm is much 
faster than the variational EM algorithm for rtm, and is 
scalable in that it grows linearly with the size of the corpus. 



that at least one link exists. 

We can then predict links between those pairs where this 
probability exceeds some threshold. Since we are agnostic 
about this threshold and about the cost of Type I vs. Type 
II errors, we follow other work in this area by defining the 
accuracy of our model as the AUC, i.e. the probability that 
a random true positive link is ranked above a random true 
non-link. Equivalently, this is the area under the receiver op- 
erating characteristic curve (ROC). Our goal is to do better 
than the baseline AUC of 1/2, corresponding to a random 
ranking of the pairs. 

We carried out 10-fold cross-validation, in which the links 
in the original graph are partitioned into 10 subsets with 
equal size. For each fold, we use one subset as the test 
links, and train the model using the links in the other 9 
folds. We evaluated the AUC on the held-out links and 
the non-links. For Cora and Citeseer, all the non-links are 
used. For PubMed, we randomly chose 10% of the non- 
links for comparison. We trained the models with the same 
settings as those for document classification in Section [4.31 
we executed 100 independent runs for each test. Note that 
unlike the document classification task, here we used the full 
topic mixtures to predict links, not just the discrete labels 
consisting of the most-likely topic for each document. 

Note that pmtlm-dc assigns Sd to be zero if the degree of d 
is zero. This makes it impossible for d to have any test link 
with others if its observed degree is zero in the training data. 
One way to solve this is to assign a small positive value to 
Sd even if d's degree is zero. Our approach assigns Sd to be 
the smallest value among those Sd' that are non-zero: 



4.4 Link Prediction 

Link prediction (e.g. [8l 1231 [34] ) is a natural generalization 
task in networks, and another way to measure the quality of 
our model and our EM algorithm. Based on a training set 
consisting of a subset of the links, our goal is to rank all pairs 
without an observed link according to the probability of a 
link between them. For our models, we rank pairs according 
to the expected number of links Add' in the Poisson distri- 
bution, ((2| and ((31, which is monotonic in the probability 



Sd = m-i-n{Sd' ■■ Sd' > 0} if = . 



(17) 



Figure 4(a) gives the AUC values for pmtlm and pmtlm-dc 
as a function of the relative weight a of content vs. links. 
The green horizontal line in each of those subplots represent 
the highest AUC value achieved by the rtm model for each 
data set, using the best value of p among those specified 
in Section (331 Interestingly, for Cora and Citeseer the opti- 
mal value of a is smaller than in Figure O showing that con- 
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Figure 3: The accuracy of PMTLM, PMTLM-DC, and PHITS-PLSA on the document classification task, 
measured by the NMI, as a function of the relative weight a of the content vs. the links. At a = these 
algorithms label documents solely on the basis of their links, while at a = 1 they pay attention only to the 
content. For Cora and Citeseer, there is a broad range of a that maximizes the accuracy. For PubMed, the 
degree-corrected model PMTLM-DC performs best at a particular value of a. 




(a) AUG values for different a. 
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(b) ROC curves achieving the highest AUG values. 
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Figure 4: Performance on the link prediction task. For all three data sets and all the a values, the PMTLM-DC 
model achieves higher accuracy than the PMTLM model. In contrast to Figure [3J for this task the optimal 
value of a is relatively small, showing that the content is less important, and the topology is more important, 
for link prediction than for document classification. The green line in Figure 4(a) indicates the highest AUC 



achieved by the RTM model, maximized over the tunable parameter p. Our models outperform RTM on all 
three data sets. In addition, the degree-corrected model (PMTLM-DC) does significantly better than the 
uncorrected version (PMTLM). 



tent is less important for link prediction than for document 
classification. We also plot the receiver operating character- 
istic (ROC) curves and precision-recall curves that achieve 
the highest AUG values in Figure |4(b)| and Figure |4(c)| re- 
spectively. 

We see that, for all three data sets, our models outperform 
RTM, and that the degree-corrected model pmtlm-dc is sig- 
nificantly more accurate than the uncorrected one. 

5. CONCLUSIONS 

We have introduced a new generative model for document 
networks, ft is a marriage between Probabilistic Latent Se- 
mantic Analysis [R] and the Ball-Karrer-Newman mixed 
membership block model [2]. Because of its mathematical 
simplicity, its parameters can be inferred with a particu- 
larly simple and scalable EM algorithm. Our experiments on 
both document classification and link prediction show that 
it achieves high accuracy and efficiency for a variety of data 
sets, outperforming a number of other methods. In future 
work, we plan to test its performance for other tasks includ- 
ing supervised and semisupervised learning, active learning, 
and content prediction, i.e., predicting the presence or ab- 
sence of words in a document based on its links to other 
documents and/or a subset of its text. 
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APPENDIX 

A. UPDATE EQUATIONS FOR PMTLM 

In this appendix, we derive the update equations (|10|) - (|12|l 
for the parameters r], P, and 9, giving the M step of our 
algorithm. 

Recall that the likelihood is given by (O and ([8]). For iden- 
tifiability, we impose the normalization constraints 



Vz : ^ Pzw = 1 

W 

Vd : ^ = 1 



(18) 
(19) 



For each topic z, taking the derivative of the likelihood with 
respect to rj^ gives 



: 



Thus 



1 dC 



1 - a dr]z ri: 



(20) 



Y.dd' 



Plugging this in to ^ makes the last term a constant, 
Edd' ^dd' ~ —M. Thus we can ignore this term when 
estimating Odz- 

Similarly, for each topic z and each word w, taking the 
derivative with respect to /3z 



= -j^ — ^ Cdwhdw{z) , 



(22) 



where is the Lagrange multiplier for (|18p . Normalizing 
/3z determines u^, and gives 



J2d(^/^d)Cdwhdw{z) 

Ed(l/'f'd)E„' Cd^'hd^'{z) 



(23) 



Finally, for each document d and each topic z, taking the 
derivative with respect to 9dz gives 



dC 



Lddd: 



■ ^ Cdwhdni{z) + -7— ^ Add'qdd'{z) , 

"dz „ 

w d' 

(24) 



where \d is the Lagrange multiplier for (|19p . Normalizing 
9d determines and gives 



{a/Ld) Y.^ Cdwhdw{z) + (1 - q) Y.d' Add'Qdd'iz) 



a + (1 - a)K.d 



(25) 



B. UPDATE EQUATIONS FOR THE DEGREE- 
CORRECTED MODEL 

Recall that in the degree-corrected model pmtlm-dc, the 
number of links between each pair of documents d, d' is 
Poisson-distributed with mean 



SdSd' rjzOdzdd 



(26) 



To make the model identifiable, in addition to (|18|l and (|19p . 
we impose the following constraint on the degree-correction 
parameters. 



V2 : ^ SdOdz = 1 . 

d 

With this constraint, we have 



(27) 



-C = -5- ^ Cdwhdw{z) log 

a wz 

+ (1 - a) Kd log Sd 



^ dz^zw 

hdwiz) 



+ -—TT- yZ ( ^dd'Qdd' (z) log '^"^''"f'^'^ - SdSd'V^^dzOd 
2 dd'z^ 'i'^'^'^^l 

(28) 

The update equation (|23|l for P remains the same, since the 
degree-correction only affects the part of the model that gen- 
erates the links, not the words. We now derive the update 
equations for 77, 5", and 6. 

For each topic z, taking the derivative of the likelihood with 
respect to r]z gives 







2 dL 



I - a driz r/. 



= — Add'Qdd' {z) - SdSd't 
r)z ^-^ ^-^ 

dd' dd' 

— y^^dd'9dd'(2) - 1, 



where we used (|27|l . Thus 

r]z=^Ad 



(29) 



(30) 



so rjz is simply tiie expected number of links caused by topic 
z. In particular, 



(31) 



For Sd, we have 



1-adSa S, 



1 91/ Kd 

S'd 

Sd 



^ - '^Sd'-qzO. 



dzOd'z 



Y^VzOd. = Y,i.ed. , (32) 



where is the Lagrange multiplier for (|27p . Thus 



Sd = 



(33) 



We will determine below. However, note that multiplying 
both sides of (|32|l by Sd, summing over d, and applying (1271) 
and ((3T|) gives 



(34) 



Most importantly, for 9 we have 
dL 1 



= -H— ( 7- Cd,i,/id«,(2) + (1 - a)Y^Add'qdd'{z) I 

C^dz fc'dz \ J^d ^ — ' ^ I 

\ w d' / 

-{l~a)Y,SdSd'VzOd'z 

d' 

= J- \Y~'Y Cdwhdm{z) + (1 - a) ^yldd'qdd'(^:) 1 



(1 — a)SdVz 
: Ad + (1 - a)Sd^z 



(35) 



where Ad is the Lagrange multiplier for (|19p . and where we 
applied (|27|l in the second equality. Multiplying both sides 
of (f35|) by 9dz, summing over z, and applying (|33|l gives 



Ad = Q . (36) 
Summing over d and applying (I27p . (|30p . and (|36p gives 

— = -T' Cdnjhdwiz) — 6'dz 

a — ' Ld — — 

d w d 

= y]-^ y]Cd»(ftd»(2)-edz) . (37) 

, ^d 

d w 

Thus ^z measures how the inferred topic distributions of the 
words hdw{z) differ from the topic mixtures 9dz- 

Finally, dSS} and ^ give 
„ _ (g/I/d)!]^ Cdwhdwjz) + (1 - q) Y.d' Add'qdd'jz) 



a + {l-a){r)z+S,z)Sd 



where rjz and ^z are given by (|30)) and (f37)) . 



(38) 



C. UPDATE EQUATIONS WITH DIRICHLET 
PRIOR 

If we impose a Dirichlet prior on 9, with parameters {7z} for 
each topic 2, this gives an additional term X^dz ('^'^ ~ ^) ^"-"S ^<*z 
in the log-likelihood of both the pmtlm and pmtlm-dc mod- 
els. This is equivalent to introducing pseudocounts = 



7z — 1 for each z, which we can think of as additional words 
or links that we know are due to topic z. Our original mod- 
els, without this term, correspond to the uniform prior with 
7z = 1 and tz = 0. However, as long as 7z > 1 so that the 
pseudocounts are nonnegative, we can infer the parameters 
of our model in the same way with no loss of efficiency. 



In the PMTLM model, (|25|l becomes 

g _ tz + {a/Ld)J2w Cdwhdwjz) -I- (1 - g) J2d' ^dd'Qdd'iz) 
X^z *z + + (1 ~ 

(39) 

In the degree-corrected model pmtlm-dc, ((36)| and (p7|l be- 
come 



and 



1 - a 



Ad = Q -I- y~^tz 



^z — y^ -j^ y^ Cdw {hdw{z) — 9dz) 



(40) 



d 

a w 



+ -E 



(41) 



Note that ^z has two contributions. One measures, as before, 
how the inferred topic distributions of the words hdm (z) dif- 
fer from the topic mixtures 9dz , and the other measures how 
the fraction tz / tz' of pseudocounts for topic z differs 
from 6'dz • 

Finally, ((38]) becomes 

. tz + {a/Ld)Y.u, Cdwhdw{z) + (1 - a) Y.d' Add'qdd'(z) 



■-+{l-a){riz+^z)Sd + T,z' 



where r/z and ^z are given by (|30p and (|41|l . 



(42) 



